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ABSTRACT 

Efficient Natural Evolution Strategies (eNES) is a novel al- 
ternative to conventional evolutionary algorithms, using the 
natural gradient to adapt the mutation distribution. Unlike 
previous methods based on natural gradients, eNES uses a 
fast algorithm to calculate the inverse of the exact Fisher in- 
formation matrix, thus increasing both robustness and per- 
formance of its evolution gradient estimation, even in higher 
dimensions. Additional novel aspects of eNES include opti- 
mal fitness baselines and importance mixing (a procedure for 
updating the population with very few fitness evaluations). 
The algorithm yields competitive results on both unimodal 
and multimodal benchmarks. 
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1. INTRODUCTION 

Evolutionary algorithms aim to optimize a 'fitness' func- 
tion that is either unknown or too complex to model di- 
rectly. They allow domain experts to search for good or 
near-optimal solutions to numerous difficult real- world prob- 
lems in areas ranging from medicine and finance to control 
and robotics. 

Typically, three objectives have to be kept in mind when 
developing evolutionary algorithms — we want (1) robust per- 
formance; (2) few (potentially costly) fitness evaluations; (3) 
scalability with problem dimensionality. 

We recently introduced Natural Evolution Strategies (NES; 
[8]), a new class of evolutionary algorithms less ad- hoc than 
traditional evolutionary methods. Here we propose a novel 
algorithm within this framework. It retains the theoretically 
well-founded nature of the original NES while addressing its 
shortcomings w.r.t. the above objectives. 
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NES algorithms maintain and iteratively update a multi- 
normal mutation distribution. Parameters are updated by 
estimating a natural evolution gradient, i.e. the natural gra- 
dient on the parameters of the mutation distribution, and 
following it towards better expected fitness. Well-known ad- 
vantages of natural gradient methods include isotropic con- 
vergence on ill-shaped fitness landscapes [2]. This avoids 
drawbacks of 'vanilla' (regular) gradients which arc prone 
to slow or premature convergence [4j. 

Our algorithm calculates the natural evolution gradient 
using the exact Fisher information matrix (FIM) and the 
Monte Carlo-estimated gradient. In conjunction with the 
techniques of optimal fitness baselines and fitness shaping 
this yields robust performance (objective 1). 

To reduce the number of potentially costly evaluations 
(objective 2), we introduce importance mixing, a kind of 
steady-state enforcer which keeps the distribution of the new 
population conformed to the current mutation distribution. 

To keep the computational cost manageable in higher prob- 
lem dimensions (objective 3), we derive a novel, efficient 
algorithm for computing the inverse of the exact Fisher in- 
formation matrix (previous methods were either inefficient 
or approximate). 

The resulting algorithm, Efficient Natural Evolution Strate- 
gies (eNES), is elegant, requires no additional heuristics and 
has few parameters that need tuning. It performs consis- 
tently well on both unimodal and multimodal benchmarks. 

2. EVOLUTION GRADIENTS 

First let us introduce the algorithm framework and the 
concept of evolution gradients. The objective is to maximize 
a d-dimensional unknown fitness function / : E d — > R, while 
keeping the number of function evaluations - which are con- 
sidered costly - as low as possible. The algorithm iteratively 
evaluates a population of size n individuals Zi . . . z n gener- 
ated from the mutation distribution p (z\0). It then uses the 
fitness evaluations /(zi) . . . f(z n ) to adjust parameters 9 of 
the mutation distribution. 

Let J (6) = E [/ (z) \9\ be the expected fitness under mu- 
tation distribution p(z\0), namely, 

J{0) = j f{z)p{z\9)dz. 

The core idea of our approach is to find, at each itera- 
tion, a small adjustment 89, such that the expected fitness 



J (9 + 88) is increased. The most straightforward approach 
is to set 88 oc VgJ (8), where Ve J (8) is the gradient on J (8). 
Using the 'log likelihood trick', the gradient can be written 
as 



V fl J(0) = V B j f(z)p(z\8)dz 
= f f(z) V e p(z\8)dz 



p(z|0)-(/(z)V e lnp(zj0))dz, 



The last term can be approximated using Monte Carlo: 

VeJ(0) = -V™ /(z,)V 9 lnp(z^), 
n — 'i=l 

where V|J (8) denotes the estimated evolution gradient. 

In our algorithm, we assume that p(z\8) is a Gaussian 
distribution with parameters 8 = (x, A) , where x represents 
the mean, and A represents the Cholesky decomposition of 
the covariance matrix C, such that A is upper triangular 
matrix ancQ C = A T A. The reason why we choose A in- 
stead of C as primary parameter is twofold. First, A makes 
explicit the d(d + 1) /2 independent parameters determin- 
ing the covariance matrix C. Second, the diagonal elements 
of A are the square roots of the eigenvalues of C, so A T A 
is always positive semidefinite. In the rest of the text, we 
assume 8 is column vector of dimension d s = d + d(d + 1) /2 
with elements in (x, A) arranged as 



Here 8° = x and 8 k — [dk,k ■ ■ ■ o*k,d] for 1 < k < d, where 
a i.j (i 5: j) denotes the (i, j)-th element of A. 
Now we compute 



g( z l' 



Velnp(z|6>) 
V e {^ln2^-iln|A| 2 



-i(A- T (z-x)) T (A- T (z-x))}, 



where g (z\8) is assumed to be a d s -dimensional column vec- 
tor. The gradient w.r.t. x is simply 

V x hip (z|0) = CT (z - x) . 

The gradient w.r.t. aij (i < j) is given by 

a 



do 



■lnp(z|0) =r i , j -S(i,j)ail, 



where nj is the (i,j)-th element of 

R = A~ T (z - x) (z - x) T C 

and 8 is the Kronecker Delta function. 

From g (z\9), the mutation gradient V| J (8) can be com- 
puted as V s g J(8) = iGf, where G = [g (zi \8) . . . g (z n \8)], 
and f = [/(zi) .../(z n )] T . We update 8 by 88 = r?V|J(0), 
where r\ is an empirically tuned step size. 



x For any matrix Q, Q denotes its inverse and Q T denotes 
its transpose. 



3. NATURAL GRADIENT 

Vanilla gradient methods have been shown to converge 
slowly in fitness landscapes with ridges and plateaus. Natu- 
ral gradients VU constitute a principled approach for dealing 
with such problems. The natural gradient, unlike the vanilla 
gradient, has the advantage of always pointing in the direc- 
tion of the steepest ascent. Furthermore, since the natural 
gradient is invariant w.r.t. the particular parameterization 
of the mutation distribution, it can cope with ill-shaped fit- 
ness landscapes and provides isotropic convergence proper- 
ties, which prevents premature convergence on plateaus and 
avoids overaggressive steps on ridges [I]. 

In this paper, we consider a special case of the natural 
gradient VgJ, defined as 

88 T V 9 J = m&xj (8 + 88) , 

69 

s.t. KL (8 + S8\\8) =e, 

where e is an arbitrarily small constant and KL(8'\\8) de- 
notes the Kullback-Leibler divergence between distributions 
p{z\8') and p{z\8). The constraints impose a geometry on 
8 which differs from the plain Euclidean one. With e — > 0, 
the natural gradient VgJ satisfies the necessary condition 
FVe J = Ve J, with F being the Fisher information matrix: 



E 



[v 9 lnp(z|(9)V s lnp(z|0) T ] 



If F is invertible, which may not always be the case, 
the natural gradient can be uniquely identified by VeJ = 
F~VeJ, or estimated from data using F~VgJ. The adjust- 
ment 88 can then be computed by 

88 = vF~V s oJ- 

In the following sub-sections, we show that the FIM can 
in fact be computed exactly, that it is invertible, and that 
there exists an efficienl|^] algorithm to compute the inverse 
of the FIM. 

3.1 Derivation of the Exact FIM 

In the original NES p], we compute the natural evolu- 
tion gradient using the empirical Fisher information matrix, 
which is estimated from the current population. This ap- 
proach has three important disadvantages. First, the em- 
pirical FIM is not guaranteed to be invertible, which could 
result in unstable estimations. Second, a large population 
size would be required to approximate the exact FIM up to a 
reasonable precision. Third, it is highly inefficient to invert 
the empirical FIM, a matrix with O (d 4 ) elements. 

We circumvent these problems by computing the exact 
FIM directly from mutation parameters 8, avoiding the po- 
tentially unstable and computationally costly method of es- 
timating the empirical FIM from a population which in turn 
was generated from 8. 

In eNES, the mutation distribution is the Gaussian de- 
fined by 8 = (x, A) , the precise FIM F can be computed 
analytically. Namely, the (m, n)-th element in F is given by 



(F) 



dx 1 



ax i 

W n + 2 tr 



dC PC 



where 8 m , 8 n denotes the m-th and n-th element in 8. Let 
im,jm be the di mt j m such that it appears at the (d + m)-th 



2 Normally the FIM would involve d 2 a = O (d 4 ) parameters, 
which is intractable for most practical problems. 



position in 9. First, notice that 



<9x T _ cbc 

dxi dxj 



and 



<9x T <9x 



9x T _ <9x 

9xi c>aj,fe 



0. 



So the upper left corner of the FIM is C , and F has the 
following shape 

C 
F A 

The next step is to compute Fa- Note that 



(F A ) m , n = =tr 



dC 



dC 



dai 



Using the relation 

dC d 



den, 



. T a dA T . . T OA 
A A =- A + A 



dai 



and the properties of the trace, we get 

OA 



(F A ) = tr 



da, t j ' 



OA 



da 



+ tr 



<9A 



9ai n ,j n 
dA T 1 



Here Dfe is the lower-right square submatrix of C with 
dimension d + 1 — k, e.g. Di = C , and Dd = (C~) d . 

We prove that the FIM given above is invertible if C is 
invertible. Fj, (1 < k < d) being invertible follows from the 
fact that the submatrix on the main diagonal of a pos- 
itive definite matrix C must also be positive definite, and 
adding a' k \ > to the diagonal would not decrease any of 

its eigenvalues. Also note that Fo = C is invertible, so F 
is invertible. 

It is worth pointing out that the block diagonal structure 
of F partitions parameters 9 into d + 1 orthogonal groups 
9° ...9 , which suggests that we could modify each group 
of parameters without affecting other groups. We will need 
this intuition in the next section. 

3.2 Iterative Computation of FIM Inverse 

The exact FIM is a block diagonal matrix with d+1 blocks. 
Normally, inverting the FIM requires d matrix inversions. 
However, we can explore the structure of each sub-block in 
order to make the inverse of F more efficient, both in terms 
of time and space complexity. 

First, we realize that Fd is simply a number, so its in- 
version is given by F d = ^(C~) dd + a dd j , and similarly 

D d = ^(C~) d . Now, letting k vary from d — 1 to 1, 

we can compute F^T and directly from ~D^+i- By block 
matrix inversion 



Computing the first term gives us 
OA . _ OA 



tr 



da t . 



us 


Pll 


P12 " 




Qr 


-PnPiaQj 




P21 


P22 _ 




Q 2 PaiPn 


Q2 



Note that since A is upper triangular, A is also upper 
triangular, so the first summand is non-zero iff 



In this case, (A ) 
dA 



tr 



— — J n — J 

. _ dA 



- - = a i i , so 



dai 



dai. 



Here S (•) is the generalized Kronecker Delta function, i.e. 
S (im,in,jm,jn) = 1 iff all four indices are the same. The 
second term is computed as 



using 

Qi = Pn P12P22P21, Q 2 = P22 P21PI1P12, 
and the Woodbury identity 

Q 2 - = [P 22 + P21 (-P n ) P 2 T i] " 

= P22 - P22P21 [-Pll + P2 r iP 2 2P2l] PJ1P22, 

(also noting that in our case, Pn is a number (C~) fc ), we 
can state 



Q 2 =P 2 



tr 



da 



dA dA T 



dai. 



( c ),• , S(i n ,i m ). 



(PggPgl) (P2" 2 P2l) T 
P21P22P2I — Pll 



Therefore, we have 

(F A ) mj „ = { C ~) jn ,j m S (in,im) +0-i£,iJ (im,in,jm,jn) ■ 

It can easily be proven that Fa itself is a block diagonal 
matrix with d blocks along the diagonal, with sizes ranging 
from d to 1. Therefore, the precise FIM is given by 



F = 



with Fo = C and block F fe (d > k > 1) given by 



This can be computed directly from P^, i-e. TJ^ +1 . Skipping 
the intermediate steps, we propose the following algorithm 
for computing F^ and from D^ +1 : 



k+l:d,k ■ 



= ( C ~)fe.fe>" = D 



/^ + l , 



= (C") 

= v T u, q = (w — s) _1 , qp = (wf — s)^ 1 
= -to -1 (1 + qs) , c F = -Wp 1 (1 + q F s) , 



D7 



Here (C~ 



IF 
CpU 1 

q 



D 



CpU T 

fc+1 + 9f uu 

T 1 



cu 

D fc+i + ^uu 



*M 




), , , , , is the sub- vector in C at column k, and 

v /fc + l:a,fc 

row k + 1 to d. A single update from Dj~ +1 to Ffe an d 
D^T requires O ((d — k) 2 ) floating point multiplications. The 



overall complexity of computing all sub-blocks F k , 1 < fc < 
d, is thus O (d 3 ) . 

The algorithm is efficient both in time and storage in the 
sense that, on one hand, there is no explicit matrix inversion, 
while on the other hand, the space for storing (including 
Ffe, if not needed later) can be reclaimed immediately after 
each iteration, which means that at most O (d 2 ) storage is 
required. Note also that FjjT can be used directly to compute 
89 k , using 89 k = F-G k f, where 

G fc = [g fc ( Zl ),...,g fc (z„)] 

= [V e *Inp(z|0),...,V fi *lnp(z|0)] 

is the submatrix of G w.r.t. the mutation gradient of 9 k . 

To conclude, the algorithm given above efficiently com- 
putes the inverse of the exact F1M, required for computing 
the natural mutation gradient. 

4. OPTIMAL FITNESS BASELINES 

The concept of fitness baselines, first introduced in [8], 
constitutes an efficient variance reduction method for esti- 
mating 89. However, baselines as found in [H] are suboptimal 
w.r.t. the variance of 89, since this F1M may not be invert- 
ible. ft is difficult to formulate the variance of 89 directly. 
However, since the exact FIM is invertible and can be com- 
puted efficiently, we can in fact compute an optimal baseline 
for minimizing the variance of 89, given by 

Var(<50) = 77 2 E[(F"VeJ-E [F"VeJ]) T 

•(F-VSJ-E[P-VSJ])], 

where VgJ is the estimated evolution gradient, which is 
given by 

V«J=-V" lf(zi)-b]Velnp(zi\0). 
n — '1=1 

The scalar b is called the fitness baseline. Adding b does not 
affect the expectation of Vg J, since 

E[V^J] = V e j (/(z)-6)p(z|0)dz 

= V B J /(z)p(z|0)dz. 

However, the variance depends on the value of b, i.e. 

Var(50) oc b 2 E [(F~Gl) T (F"Gl)] 

-2&E [(F~Gf) T (F"G1)] + const. 

Here 1 denotes a n-by-1 vector filled with Is. The optimal 
value of the baseline is 

E [(F-Gf) T (F-Gl)l 

b — — — . 

E [(F-G1) T (F-Gl)] 

Assuming individuals are i.i.d., b can be approximated from 
data by 

h ^_ E^i/W(F-g(zQ) T (F-g(z < )) 

E: =1 (F-g(z,)) T (F-g(z,)) 

In order to further reduce the estimation covariance, we 
can utilize a parameter-specific baseline for each parameter 



9j individually, which is given by 

= E [(hjGf) (h 3 Gl)] E^ 1 /(zQ(h J g(z t )) 2 
3 E [(h-,Gl) (hjGl)] ELi(h,gW) 2 ' 

Here hj is the j-th row vector of F . 

However, parameter-specific baseline values 9j might re- 
duce variance too much, which harms the performance of the 
algorithm. Additionally, adopting different baseline values 
for correlated parameters may affect the underlying struc- 
ture of the parameter space, rendering estimations unreli- 
able. To address both of these problems, we follow the in- 
tuition that if the (m, n)-th element in the FIM is 0, then 
parameters 9 m and 9 n are orthogonal and only weakly cor- 
related. Therefore, we propose using the block fitness base- 
line, i.e. a single baseline b k for each group of parameters 
, < k < d. Its formulation is given by 

b , = g [(^G fc f) (F-G fc l)] 
E[(F-G*1) (F-G^l)] 

^ E:U/(zQ(F fc V (zQ) T (F fc V (*Q) 

Er =1 (Ffeg fc ( Zl )) T (F fc -g fc W) 

Here F^" denotes the inverse of the fc-th diagonal block of 
F~, while G fc and g fc denote the submatrices corresponding 
to differentiation w.r.t. 9 k . 

In a companion paper [7], we empirically investigated the 
convergence properties when using the various types of base- 
line. We found block fitness baselines to be very robust, 
whereas uniform and parameter-specific baselines sometimes 
led to premature convergence. 

The main complexity for computing the optimal fitness 
baseline pertains to the necessity of storing a potentially 
large gradient matrix G, with dimension d s x n ~ O (nd 2 ). 
The time complexity, in this case, is O (nd 3 ) since we have 
to multiply each F;7 with G fe . For large problem dimen- 
sions, the storage requirement may not be acceptable since 
n also scales with d. We solve this problem by introducing 
a time-space tradeoff which reduces the storage requirement 
to O (d 2 ) while keeping the time complexity unchanged. 
In particular, we first compute for each k, a scalar Uk = 
aj~ (z — x), where a^T is the fc-th row vector of A~. Then, 
for I < i < n, we compute the vector v = (C~) fc . d (z — x), 
where (C - ), , is the submatrix of C~ by taking rows fc to 
d. The gradient g fc (Zj) can be computed from Uk and v, 
and used to compute F^g fc (zj) directly. The storage for 
g (zi) can be immediately reclaimed. Finally, the complex- 
ity of computing g fc (z^) for all i is 0(nd(d— fc)), so the 
total complexity of computing every element in G would 
still be O (nd 3 ). 

5. IMPORTANCE MIXING 

At each generation, we evaluate n new individuals gen- 
erated from mutation distribution p(z|0). However, since 
small updates ensure that the KL divergence between con- 
secutive mutation distributions is generally small, most new 
individuals will fall in the high density area of the previ- 
ous mutation distribution p(z|0'). This leads to redundant 
fitness evaluations in that same area. 

Our solution to this problem is a new procedure called 
importance mixing, which aims to reuse fitness evaluations 



min < 1, (1 — q) 



max < a, 1 



from the previous generation, while ensuring the updated 
population conforms to the new mutation distribution. 

Importance mixing works in two steps: In the first step, 
rejection sampling is performed on the previous population, 
such that individual z is accepted with probability 

P (z\e>) 

Here a G [0, 1] is the minimal refresh rate. Let n a be the 
number of individuals accepted in the first step. In the sec- 
ond step, reverse rejection sampling is performed as follows: 
Generate individuals from p(z\8) and accept z with proba- 
bility 

p(z\8) 

until n — n a new individuals are accepted. The n a individu- 
als from the old generation and n — n a newly accepted indi- 
viduals together constitute the new population. Note that 
only the fitnesses of the newly accepted individuals need 
to be evaluated. The advantage of using importance mix- 
ing is twofold: On the one hand, we reduce the number of 
fitness evaluations required in each generation, on the other 
hand, if we fix the number of newly evaluated fitnesses, then 
many more fitness evaluations can potentially be used to 
yield more reliable and accurate gradients. 

The minimal refresh rate a lower bounds the expected 
proportion of newly evaluated individuals p = E [ n ~™ a ] > 
namely p > a, with the equality holding iff 8 = 8'. In 
particular, if a = 1, all individuals from the previous gen- 
eration will be discarded, and if a = 0, p depends only 
on the distance between p(z\8) and p(z\0'). Normally we 
set a to be a small positive number, e.g. 0.01, to avoid 
too low an acceptance probability at the second step when 
p(z|0')/ P (z|0)~l. 

It can be proven that the updated population conforms 
to the mutation distribution p(z\8). In the region where 
(1 — a)p(z\8) /p(z\8') < 1, the probability that an individ- 
ual from previous generations appears in the new population 
is 

p {z\8') ■ (1 - a)p(z\8) lp (z\8') = (1 - a)p(z\0) . 

The probability that an individual generated from the sec- 
ond step entering the population is ap(z\8), since 

max {a, 1 — p (z\9) /p (z|#)} = a. 

So the probability of an individual entering the population 
is just p(z\8) in that region. The same result holds also for 
the region where (1 — a)p(z\8) /p(z\8') > 1. 

In a companion paper [7], we measured the usefulness of 
importance mixing, and found that it reduces the number 
of required fitness evaluations by a factor 5. Additionally, it 
reduced the algorithm's sensitivity to the population size. 

The computational complexity of importance mixing can 
be analyzed as follows. For each generated individual z, the 
probability p(z\8) and p(z\8') need to be evaluated, requir- 
ing O (d 2 ) computations. The total number of individuals 
generated is bounded by n/a in the worst case, and is close 
to n on average. 

6. FITNESS SHAPING 

For problems with wildly fluctuating fitnesses, the gradi- 
ent is disproportionately distorted by extreme fitness values, 



which can lead to premature convergence or numerical in- 
stability. To overcome this problem, we use fitness shaping, 
an order-preserving nonlinear fitness transformation func- 
tion [8]. The choice of (monotonically increasing) fitness 
shaping function is arbitrary, and should therefore be consid- 
ered to be one of the tuning parameters of the algorithm. We 
have empirically found that ranking-based shaping functions 
work best for various problems. The shaping function used 
for all experiments in this paper was fixed to /'(z) = 2i — 1 
for i > 0.5 and /'(z) = for i < 0.5, where i denotes the rel- 
ative rank of / (z) in the population, scaled between ... 1. 

7. EFFICIENT NES 

Integrating all the algorithm elements introduced above, 
the Efficient Natural Evolution Strategy (with block fitness 
baselines) can be summarized as 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 



initialize A <— I 
repeat 

compute A - , and C~ = A~A~ T 

update population using importance mixing 

evaluate / (zj) for new Zj 

compute rank-based fitness shaping / 

for k = d to 

compute the exact FIM inverse Fj7 
u «— 0, v «— 0, si «— 0, s-2 <— 
for i = 1 to n 
q^F-g fe _(z0 
u «— u + / (zj) q 
v<-vlq 
Si <- Si + f (zj) q T q 
s 2 «- s 2 + q q 
end 

b k «- si/s 2 
88 k «- u - b k v 
end 

8 <- 8 + nS8 
until stopping criteria is met 



Note that vectors u and v in line 18 correspond to F k G k f 
and F ; TG fe l, respectively. Summing up the analysis from 
previous sections, the time complexity of processing a sin- 
gle generation is O (nd 3 ) , while the space complexity is just 
O (d 2 + nd) , where O (nd) comes from the need of storing 
the population. Assuming that n scales linearly with d, our 
algorithms scales linearly in space and quadratically in time 
w.r.t. the number of parameters, which is O (d 2 ) . This is a 
significant improvement over the original NES, whose com- 
plexity is O (d 4 ) in space and O (d 6 ) in time. 

Implementations of eNES are available in both Python 
and Matlat0 

8. EXPERIMENTS 

The tunable parameters of Efficient Natural Evolution 
Strategies are comprised of the population size n, the learn- 
ing rate 77, the refresh rate a and the fitness shaping function. 
In addition, three kinds of fitness baselines can be used. 

We empirically find a good and robust choice for the learn- 
ing rate n to be 1.0. On some (but not all) benchmarks the 

3 The Python code is part of the PyBrain machine learning 
library (www . pybrain . org) and the Matlab code is available 
at www.idsia.ch/~sun/enes.html 
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Figure 1: Results on the unimodal benchmark func- 
tions for dimension 5, 15 and 50 (from top to bot- 
tom). 



Figure 2: Success percentages varying with initial 
distances for the multimodal test functions using 
population sizes 20 and 100. 



performance can be further improved by more aggressive up- 
dates. Therefore, the only parameter that needs tuning in 
practice is the population size, which is dependent on both 
the expected ruggedness of the fitness landscape and the 
problem dimensionality. 

8.1 Benchmark Functions 

We empirically validate our algorithm on 9 unimodal and 
4 multimodal functions out of the set of standard bench- 
mark functions from [6] and [5], that are typically used in 
the literature, for comparison purposes and for competitions. 
We randomly choose the inital guess at average distance 1 
from the optimum. In order to prevent potentially biased 
results, we follow [6] and consistently transform (by a com- 
bined rotation and translation) the functions' inputs, mak- 
ing the variables non-separable and avoiding trivial optima 
(e.g. at the origin). This immediately renders many other 
methods virtually useless, since they cannot cope with corre- 
lated mutation directions. eNES, however, is invariant under 
translation and rotation. In addition, the rank-based fitness 
shaping makes it invariant under order-preserving transfor- 
mations of the fitness function. 

8.2 Performance on Benchmark Functions 

We ran eNES on the set of unimodal benchmark functions 
with dimensions 5, 15 and 50 with population sizes 50, 250 
and 1000, respectively, using rj = 1.0 and a target precision 
of 10 -10 . Figure 1 shows the average performance over 20 
runs (5 runs for dimension 50) for each benchmark function. 
We left out the Rosenbrock function on which eNES is one 
order of magnitude slower than on the other functions (e.g. 
150,000 evaluations on dimension 15). Presumably this is 
due to the fact that the principal mutation direction is up- 
dated too slowly on complex curvatures. Note that SharpR 
and ParabR are unbounded functions, which explains the 
abrupt drop-off. 

For the experiments on the multimodal benchmark func- 
tions we varied the distance of the initial guess to the opti- 
mum between 0.1 and 1000. Those runs were performed on 
dimension 2 with a target precision of 0.01, since here the 




Figure 3: Evolution path and mutation distributions 
for a typical run on Rastrigin. Ellipsoids correspond 
to 0.5 standard deviations of the mutation distribu- 
tions in generations 1, 20, 40. 

focus was on avoiding local maxima. We compare the re- 
sults for population size 20 and 100 (with r\ — 1.0). Figure 2 
shows, for all tested multimodal functions, the percentage of 
100 runs where eNES found the global optimum (as opposed 
to it getting stuck in a local extremum) conditioned on the 
distance from the initial guess to the optimum. 

Note that for Ackley and Griewank the success probabil- 
ity drops off sharply at a certain distance. For Ackley this 
is due to the fitness landscapes providing very little global 
structure to exploit, whereas for Giewank the reason is that 
the local optima are extremely large, which makes them vir- 
tually impossible to escape frorrrl Figure 3 shows the evo- 
lution path of a typical run on Rastrigin, and the ellipses 
corresponding to the mutation distribution at different gen- 
erations, illustrating how eNES jumps over local optima to 
reach the global optimum. 

For three functions we find that eNES finds the global 
optimum reliably, even with a population size as small as 
20. For the other one, Rastrigin, the global optimum is only 
reliably found when using a population size of 100. 

9. DISCUSSION 

Unlike most evolutionary algorithms, eNES boasts a rel- 
atively clean derivation from first principles. Using a full 
multinormal mutation distribution and fitness shaping, the 
eNES algorithm is invariant under translation and rotation 
and under order-preserving transformations of the fitness 
function. 

Comparing our empirical results to CMA-ES 13], consid- 
ered by many to be the 'industry standard' of evolutionary 
computation, we find that eNES is competitive but slower, 
especially on higher dimensions. However, eNES is faster 

4 A solution to this would be to start with a significantly 
larger initial C, instead of I 



on DiffPow for all dimensions. On multimodal benchmarks 
eNES is competitive with CMA-ES as well, as compared to 
the results in [8]. Our results collectively show that eNES 
can compete with state of the art evolutionary algorithms 
on standard benchmarks. 

Future work will also address the problems of automat- 
ically determining good population sizes and dynamically 
adapting the learning rate. Moreover, we plan to investi- 
gate the possibility of combining our algorithm with other 
methods (e.g. Estimation of Distribution Algorithms) to ac- 
celerate the adaptation of covariance matrices, improving 
performance on fitness landscapes where directions of ridges 
and valleys change abruptly (e.g. the Rosenbrock bench- 
mark) . 

10. CONCLUSION 

Efficient NES is a novel alternative to conventional evo- 
lutionary algorithms, using a natural evolution gradient to 
adapt the mutation distribution. Unlike previous natural 
gradient methods, eNES quickly calculates the inverse of the 
exact Fisher information matrix. This increases robustness 
and accuracy of the evolution gradient estimation, even in 
higher-dimensional search spaces. Importance mixing pre- 
vents unnecessary redundancy embodied by individuals from 
earlier generations. eNES constitutes a competitive, theoret- 
ically well-founded and relatively simple method for artificial 
evolution. Good results on standard benchmarks affirm the 
promise of this research direction. 
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